NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training

https://doi.org/10.1145/3620665.3640399

Chen, Hongzheng; Yu, Cody Hao; Zheng, Shuai; Zhang, Zhen; Zhang, Zhiru; Wang, Yida (April 2024, International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'2024))

Full Text Available
Efficient Memory Management for Large Language Model Serving with PagedAttention

https://doi.org/10.1145/3600006.3613165

Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph; Zhang, Hao; Stoica, Ion (October 2023, ACM)

Full Text Available
AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators

https://doi.org/10.1145/3494534

Sohrabizadeh, Atefeh; Yu, Cody Hao; Gao, Min; Cong, Jason (July 2022, ACM Transactions on Design Automation of Electronic Systems)

Adopting FPGA as an accelerator in datacenters is becoming mainstream for customized computing, but the fact that FPGAs are hard to program creates a steep learning curve for software programmers. Even with the help of high-level synthesis (HLS) , accelerator designers still have to manually perform code reconstruction and cumbersome parameter tuning to achieve optimal performance. While many learning models have been leveraged by existing work to automate the design of efficient accelerators, the unpredictability of modern HLS tools becomes a major obstacle for them to maintain high accuracy. To address this problem, we propose an automated DSE framework— AutoDSE —that leverages a bottleneck-guided coordinate optimizer to systematically find a better design point. AutoDSE detects the bottleneck of the design in each step and focuses on high-impact parameters to overcome it. The experimental results show that AutoDSE is able to identify the design point that achieves, on the geometric mean, 19.9× speedup over one CPU core for MachSuite and Rodinia benchmarks. Compared to the manually optimized HLS vision kernels in Xilinx Vitis libraries, AutoDSE can reduce their optimization pragmas by 26.38× while achieving similar performance. With less than one optimization pragma per design on average, we are making progress towards democratizing customizable computing by enabling software programmers to design efficient FPGA accelerators.
more » « less
Full Text Available
Analysis and Optimization of the Implicit Broadcasts in FPGA HLS to Improve Maximum Frequency

https://doi.org/10.1145/3373087.3375332

Guo, Licheng; Lau, Jason; Chi, Yuze; Wang, Jie; Yu, Cody Hao; Chen, Zhe; Zhang, Zhiru; Cong, Jason (July 2020, Proceedings of the 57th Design Automation Conference (DAC 2020), San Francisco, CA)

Designs generated by high-level synthesis (HLS) tools typically achieve a lower frequency compared to manual RTL designs. In this work, we study the timing issues in a diverse set of realistic and complex FPGA HLS designs. (1) We observe that in almost all cases the frequency degradation is caused by the broadcast structures generated by the HLS compiler. (2)We classify three major types of broadcasts in HLS-generated designs, including high-fanout data signals, pipeline flow control signals and synchronization signals for concurrent modules. (3) We reveal a number of limitations of the current HLS tools that result in those broadcast-related timing issues. (4) We propose a set of effective yet easy-to-implement approaches, including broadcast-aware scheduling, synchronization pruning, and skid-buffer-based flow control. Our experimental results show that our methods can improve the maximum frequency of a set of nine representative HLS benchmarks by 53% on average. In some cases, the frequency gain is more than 100 MHz.
more » « less
Full Text Available
From JVM to FPGA: Bridging Abstraction Hierarchy via Optimized Deep Pipelining

Cong, Jason; Wei, Peng; Yu, Cody Hao (July 2018, The 10th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 2018))

Full Text Available
Customizable Computing—From Single Chip to Datacenters

https://doi.org/10.1109/JPROC.2018.2876372

Cong, Jason; Fang, Zhenman; Huang, Muhuan; Wei, Peng; Wu, Di; Yu, Cody Hao (January 2019, Proceedings of the IEEE)

Full Text Available
Automated accelerator generation and optimization with composable, parallel and pipeline architecture

https://doi.org/10.1145/3195970.3195999

Cong, Jason; Wei, Peng; Yu, Cody Hao; Zhang, Peng (January 2018, DAC 2018)

Full Text Available
HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing

https://doi.org/10.1145/3289602.3293910

Lai, Yi-Hsiang; Chi, Yuze; Hu, Yuwei; Wang, Jie; Yu, Cody Hao; Zhou, Yuan; Cong, Jason; Zhang, Zhiru (January 2019, FPGA 2019)

Full Text Available
S2FA: an accelerator automation framework for heterogeneous computing in datacenters

https://doi.org/10.1145/3195970.3196109

Yu, Cody Hao; Wei, Peng; Grossman, Max; Zhang, Peng; Sarker, Vivek; Cong, Jason (January 2018, DAC 2018)

Full Text Available

Search for: All records